The Annals of Applied Statistics 
2012, Vol. 6, No. 4, 1838-1860 
DOI: 10.1214/12-AOAS570 

@ Institute of Mathematical Statistics, 2012 



COMPOSITE GAUSSIAN PROCESS MODELS FOR EMULATING 
EXPENSIVE FUNCTIONS^ 

By Shan Ba and V. Roshan Joseph 

Georgia Institute of Technology 

A new type of nonstationary Gaussian process model is devel- 
oped for approximating computationally expensive functions. The 
new model is a composite of two Gaussian processes, where the first 
one captures the smooth global trend and the second one models lo- 
cal details. The new predictor also incorporates a flexible variance 
model, which makes it more capable of approximating surfaces with 
varying volatility. Compared to the commonly used stationary Gaus- 
sian process model, the new predictor is numerically more stable and 
can more accurately approximate complex surfaces when the experi- 
mental design is sparse. In addition, the new model can also improve 
the prediction intervals by quantifying the change of local variability 
associated with the response. Advantages of the new predictor are 
demonstrated using several examples. 

1. Introduction. The modern era witnesses the prosperity of computer 
experiments, which play a critical role in many fields of technological devel- 
opment where the traditional physical experiments are infeasible or unafford- 
able to conduct. By developing sophisticated computer simulators, people 
are able to evaluate, optimize and test complex engineering systems even 
before building expensive prototypes. The computer simulations are usu- 
ally deterministic (no random error), yield highly nonlinear response sur- 
faces, and are very time-consuming to run. To facilitate the analysis and 
optimization of the underlying system, surrogate models (or emulators) are 
often fitted to approximate the unknown simulated surface based on a fi- 
nite number of evaluations [Sacks et al. (1989)]. Santner, Williams and Notz 
(2003) and Fang, Li and Sudjianto (2006) provide detailed reviews on the 
related topics. 



Received August 2011; revised May 2012. 

^Supported by U.S. National Science Foundation Grants CMMI-0448774 and DMS-10- 
07574. 

Key words and phrases. Computer experiments, functional approximation, kriging, 
nugget, nonstationary Gaussian process. 

This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Applied Statistics, 

2012, Vol. 6, No. 4, 1838-1860. This reprint differs from the original in pagination 

and typographic detail. 



1 



2 



S. BA AND V. R. JOSEPH 



In computer experiments, the stationary Gaussian process (GP) model is 
popularly used for approximating computationally expensive simulations. 
Its framework is built on modeling the computer outputs Y(x),x G M^, 
as a realization of a stationary GP with constant mean /x and covariance 
function cr^ cov(y(x + h), y(x)) =cj^i2(h), where the correlation R{h.) is a 
positive semidefinite function with i?(0) = 1 and R{—h) = i2(h). When the 
above assumptions are satisfied, the corresponding predictor can be shown 
to be a best linear unbiased predictor (BLUP), in the sense that it mini- 
mizes the mean squared prediction error. Nevertheless, many studies in the 
literature have pointed out that the artificial assumption of second-order 
stationarity for the GP model are more for theoretical convenience rather 
than for representing reality, and they can be easily challenged in practice. 
If these assumptions deviate from the truth, the predictor is no longer op- 
timal, and sometimes can even be problematic [see the discussions, e.g., in 
Joseph (2006), Xiong et al. (2007), Gramacy and Lee (2012)]. 

When the constant mean assumption for the GP model is violated, a 
frequently observed consequence is that the predictor tends to revert to 
the global mean, especially at locations far from design points. Consider a 
simple example from Xiong et al. (2007). Suppose the true function is y{x) = 
sin(30(x — 0.9)'*) cos(2(x — 0.9)) + (x — 0.9)/2 and we choose 17 unequally 
spaced points from [0, 1] to evaluate the function. The function and design 
points are illustrated in Figure 1. Obviously, the mean of this function in 
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Fig. 1. Plot of function y{x) = sin(30(a; - 0.9)") cos(2(,x- - 0.9)) + {x - 0.9)/2, the global 
mean and the ordinary kriging predictor. 
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region x € [0, 0.4] is much smaller than the mean in region x € [0.4, 1]. When 
the data are fitted with a stationary GP model with a Gaussian correlation 
function, a constant mean for the whole region is estimated as —0.146 by 
maximizing the likelihood function [Santner, Williams and Notz (2003), page 
66] , and the corresponding predictor along with this mean value are shown in 
Figure 1. Clearly, the fit in region x E [0.4, 1] is not good, since the prediction 
is pulled down to the global mean. 

Just as a nonconstant global trend can be quite common in engineering 
systems, the variability of simulated outputs can also change dramatically 
throughout the design region. Still, consider the simple case in Figure 1, for 
example: the roughness of the one-dimensional function in region x G [0,0.4] 
is much larger than in region x € [0.4, 1] . For the GP model assuming a con- 
stant variance for the whole input region, the variance estimate for region 
X € [0.4, 1] tends to be inflated by averaging with that of the other part, 
which further contributes to the erratic prediction in this region. It is ex- 
pected that as we increase the simulation sample size, the above problem 
can be mitigated. However, since most typical applications of computer ex- 
periments involve high-dimensional inputs, the data points always tend to 
be sparse in the design region and it is almost impossible to avoid such kind 
of gaps in practice. 

In this article, we propose a more accurate modeling approach by incor- 
porating a flexible global trend and a variance model into the GP model. 
The proposed predictor has an intuitive structure and can be efficiently 
estimated in a single stage. Not only can the new predictor mitigate the 
problems discussed above, it also enjoys several additional advantages, such 
as better numerical stability, robustness to sparse design and improved pre- 
diction intervals. 

The article is organized as follows. Section 2 introduces the notation and 
existing work. Section 3 presents the new predictor and shows its interest- 
ing connections with some existing methods. In Section 4 we discuss how to 
estimate the unknown parameters by maximum likelihood. Several proper- 
ties of the new predictor are studied in Section 5, and in Section 6 we use 
several examples to demonstrate the advantages of the new method. Some 
final concluding remarks are given in Section 7. 

2. Notation and existing work. In the computer experiments literature, 
the GP model is also often referred to as the kriging model [Currin et al. 
(1991)], and these two terms are used interchangeably in this article. Suppose 
we have run the simulations under n different input settings {xi, . . . ,x„} C 
R''. Denote the corresponding computer outputs as y = (yi, . . . , yn)^ ■ A sta- 
tionary GP model, called ordinary kriging, can be formally stated as 
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where ^(x) --^ GP(0, (T^i?(-)). The ordinary kriging predictor at an input 
location x is given by 

(2) y(x)=/i + rT(x)R-i(y-/il), 

where r(x) = (i?(x — xi), . . . , ii(x — x.„))^, R is an n x ?i correlation matrix 
with the (ij)th element i?(xj — Xj), 1 is a n-dimensional vector with all 

elements 1, and /i = (l"^R~il)~i(l^R-V)- 

To remedy the predictor's reversion to mean problem as discussed in the 
previous section, a common strategy is to relax the constant mean /u in 
ordinary kriging with a global trend /i(x) and modify the model in (1) as 

(3) y(x) = Mx) + z(x). 

If the global trend is comprised of some prescribed polynomial models /u(x) = 
f~''(x)/3, where f(x) = (1, /i(x), . . . , /^(x))''' are known functions and f3 = 
(/3o, /3i, . . . , /Sm)"*" are unknown parameters, the model in (3) is called uni- 
versal kriging. Define a n x {m + 1) matrix F = (f (xi), . . . , f (x„))^, and the 
corresponding optimal predictor under model (3) can be derived as 

(4) y(x)=f"^(x)^ + r^(x)R-i(y-F^), 

where ^ = (F^R-iF)-i(F^R-V)- If K^) = f^(x)/3 is close to the true 
global trend, then clearly this approach can give much better prediction 
than that of (2). However, in practice, the correct functional form f(x) = 
(1, /i(x), . . . , /m(x))''' is rarely known, and a wrongly specified trend in uni- 
versal kriging can make the prediction even worse. For this reason, Welch 
et al. (1992) suggested using ordinary kriging instead of universal kriging. 
Another practical approach, called blind kriging, is to relax the assumption 
that the fi (x) 's are known and select them from a candidate set of functions 
using a variable selection technique [Joseph, Hung and Sudjianto (2008)]. 
Although this strategy usually leads to better fit, performing the variable 
selection while interacting with the second stage GP model is a nontrivial 
task. Considerable computational efforts are needed to properly divide up 
the total variation between the polynomial trend and the GP model. In ad- 
dition, in some cases, polynomial models may not be adequate to fit the 
complex global trend well. 

Generalizing the GP model for nonstationary variance is an even more 
challenging task. None of the above remedies for the nonstationary mean 
can in any sense alleviate the constant variance restriction, and most studies 
in the literature focus on deriving complex nonstationary covariance func- 
tions such as by spatial deformations or kernel convolution approaches [e.g., 
see Sampson and Guttorp (1992), Higdon, Swall and Kern (1999), Schmidt 
and O'Hagan (2003), Paciorek and Schervish (2006) and Anderes and Stein 
(2008)]. However, those structures may easily get overparameterized in high 
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dimensions and become computationally intractable to fit. In addition, many 
of them also require multiple observations, which is not applicable to the 
single set of outputs from computer experiments. Some other work includes 
Xiong et al. (2007), which adopts a nonlinear mapping approach based on 
a parameterized density function to incorporate the nonstationary covari- 
ance structure. Gramacy and Lee (2008) utilize the Bayesian treed structure 
to implement a nonstationary GP model. However, by dividing the design 
space into subregions, the treed GP model may lose efficiency since the 
prediction is only based on local information, and its response can also be 
discontinuous across subregions. In the next section we propose to solve the 
nonstationarity problem via a different approach. We show that the flexi- 
ble mean and variance models can be incorporated into GP by using the 
composite Gaussian process (CGP) models. 

3. Composite Gaussian process models. For clarity, in this section we 
develop the new method in two steps. First, a predictor that intrinsically in- 
corporates a flexible mean model is presented, and then we further augment 
it with a variance model to simultaneously handle the change of variability 
in the response. 

3.1. Improving the mean model. The universal kriging (or blind kriging) 
in (3) contains a polynomial mean model /u(x) as the global trend and a 
kriging model Z{x.) for local adjustments. To avoid the awkward variable 
selections in fi{x) and also make the mean model more flexible, we propose 
to use another GP to model the fJ.{x) as in the following form: 

^{^) = ^global (x) + Ziocal(x), 
(5) Zgiobal(x)~GP(^,r25(.)), 

^iocai(x)~GP(0,a2/(.)). 

Here the two GPs ^global (x) and Ziocai(x) are stationary and independent 
of each other. The first GP with variance and correlation structure g{-) 
is required to be smoother to capture the global trend, while the second GP 
with variance o"^ and correlation /(•) is for local adjustments. Just as the 
universal kriging generalizes the ordinary kriging by adding a polynomial 
mean model /i(x), the new model in (5) can be viewed as a further exten- 
sion which adopts a more sophisticated GP for global trend modeling. It 
is interesting to note that the linear model of regionalization in geostatis- 
tics [Wackernagel (2003), Chapter 14] also employs a similar structure to 
model regionalized phenomena in geological data, but its final model form 
and estimation strategies are quite different from our approach. 

Under the new assumptions in (5), the optimal predictor is easy to derive. 
Since the sum of two independent GPs is still a GP, we can equivalently 
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express (5) as ^(x) ~ GP(/i,r^g(-) +(T^Z(-)). Similar to ordinary kriging, the 
best linear unbiased predictor under the assumptions in (5) can be written 
as 



where A = ci^/r^ (A € [0, 1]) is the ratio of variances, g(x) = (g(x — xi), . . . , 
^(x — x„))^, l(x) = (Z(x — xi), . . . , — x„))^, G and L are two n x n 
correlation matrices with the (ij)th element g{xi — x,) and /(xj — Xj), re- 
spectively, and ft = (1^(G + XL)-^1)-H'^ {G + AL)-V- Here the variance 
ratio A is restricted to [0, 1] because we expect the global trend to capture 
most of the variation in the response surface than the local process. 

Although many possible correlation structures are available for g{-) and 
throughout this paper we follow the standard choice in computer ex- 
periments and specify them using the Gaussian correlation functions: 



where 6 = {6i, . . . , 9p) and a = (qi, . . . , Up) are unknown correlation param- 
eters satisfying < < ct' and <cx. The bounds a' are usually set to 
be moderately large, which ensures that the component ^global (x) is indeed 
smoother than Ziocai(x) in the fitted model. 

The new predictor in (6) is still an interpolator, since y(xj) = jl + e^ [y — 
fil) = Ui for i = 1, . . . ,n, where is a unit vector with a 1 at its ith position. 
It can also be seen that when A = (i.e., cr^ = 0), the new model reduces to 
ordinary kriging. When A G (0, 1], the predictor in (6) can be written out as 
the sum of a global predictor and a local predictor 

(8) y(x) = yglobal(x) + yiocal(x). 



It is important to note that, since the lower bounds for a in (7) are usually 
set to be moderately large, the off-diagonal elements in L are closer to 
zero. Particularly, we can obtain L — )> I when a take very large values. This 
immediately suggests two interesting properties for the CGP model. First, 
its global trend predictor ygiobai(x) in (9) resembles a kriging predictor with 
nugget effect as L — )• I. When A > 0, this nugget predictor is smooth but 
noninterpolating, and is commonly used in spatial statistics for modeling 
observational data with noise [Cressie (1991)]. Second, since L ~ I, the A in 
(G + AL) is mainly added to the diagonal elements. This makes (G + AL) 
resistent to become ill-conditioned and the computation of (G + AL)~^ in 
CGP can be numerically very stable. These two properties are elaborated 
in detail in Section 5. 



(6) 



y(x) =fi+ (g(x) + Al(x)) ' (G + AL)"^(y - ^1) 




(9) 
(10) 



ygiobai(x) =/i + g^(x)(G + AL) \y-fll) 
yiocai(x) = Ar(x)(G + AL)-i(y - 
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3.2. Improving both the mean and variance models. To further relax the 
constant variance restriction, we introduce a variance model (T^(x) into (5) 
as follows: 

y(x) = 2'giobal(x) + 0-(x)Ziocal(x), 
(11) Zgiobal(x)~GP(^,T25(.)), 
Ziocal(x)~GP(0, /(•)). 

The .^giobai(x) above remains the same as in (5), since the global trend is 
smooth and can reasonably be assumed to be stationary. After subtracting 
Zgiobai(x) from the response, the second process is augmented with a vari- 
ance model to quantify the change of local variability such that (j(x)^}ocal (x) ^ 
GP(0, cr^(x)/(-)). Overall, the model form in (11) is equivalent to assuming 
that the response y(x) ~ GP(/i, r^g(-) + cr2(x)/(-)). 

Without loss of generality, suppose the variance model can be expressed 
as iT^(x) = a'^v{x), where <t^ is an unknown variance constant and f (x) is 
the standardized volatility function which fluctuates around the unit value. 
In the following discussion, we first assume that i'(x) is known and denote 
S = diag{f (xi), . . . ,v{xn)} to represent the standardized local variances at 
each of the design points {xi,...,x„}. An efficient strategy for obtaining 
the v{x) function is presented at the end of this section. 

The model assumptions in (11) suggest that y(x) and y = {yi, ■ ■ ■ ,yn)~^ 
have the multivariate normal distribution 

'y(x) 



A* 
/.I 



y 

(12) ~ Ni+n 



t2 + a^vix.) (r2g(x) + a^v^^^ix)!:^/^^)) 

The best linear unbiased predictor under these assumptions can be derived 



as 



y(x) = A + (T2g(x) + a2^;V2(x)5]i/2l(x))^(r2G + a^sV^LE^^)-! 

(13) X (y - fa) 

= fi + (g(x) + Ai;i/2(x)5]V2i(x))^(G + \J:^'^\.J:^l^)-\y - fil), 

where A = cjVt^ (A € [0,1]), ft = (l"^(G + {G + 

AS^/^LS^/^)~^y and all the other notation remain the same as in (6). Note 
that after defining the ratio A, the unknown cj^ is no longer needed for pre- 
diction, because the predictor depends on the variance model o"2(x) only 
through A and v(x). The predictor includes (6) as a special case when the 
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local volatility model f (x) degenerates to a constant function. The predictor 
can also interpolate all the data points since (g(xj) + Au^/^(xj)S^/^l(xj))~''(G + 
AS^/^LS^/^)-^ = e7 and y{xi) = fi + ej {y - fil) = yi for i = 1, . . . , n. By 
decomposing the predictor (13) into two parts 

(14) y(x) = yglobal(x) + yiocal(x), 

(15) ygiobai(x) = ft + g^(x)(G + ASi/2LsV2)-i(y _ ^1)^ 

(16) yiocai(x) = Xv'/\^)l'{^)i:'/'{G + ASi/2LSV2)-i(y _ ^i), 

we can see that the global trend ygiobai(x) in (15) reduces to a stochastic 
kriging predictor [Ankenman, Nelson and Staum (2010)] when L — )• I. Dif- 
ferent from the nugget predictor in (9) where a universal term A is used 
for adjusting the global trend throughout the whole region, the amount of 
shrinkage at each data point in (15) is proportional to the value of Au(xj). 
This localized adjustment scheme is advantageous in making the global trend 
smoother and more stable, since it is less affected by the data points with 
large variability. 

The above predictor form is derived based on ^(x) ~ GP(;U,T^g(-) + 
cr^(x)i(-)), which unifies the modeling assumptions (11) in a single stage. As 
a result, the new method can also be viewed as extending the kriging model 
with a nonstationary covariance structure T'^g{-) +ct^(x)/(-). Different from 
this, another strategy to fulfill the new assumptions in (11) is to develop the 
global and local models sequentially: (i) Fit a global trend model as in (15) 
using the likelihood method, (ii) Obtain its residuals s = (y — ygiobai)) where 
Ygiobai = (ygiobai(xi), . . . , ygioba^Xn,))"^. If the estimated global trend interpo- 
lates all the data points (A = 0), we have s = and in this case the CGP 
just degenerates to a traditional single GP model, (iii) If s 7^ 0, standardize 
the residuals to achieve variance homogeneity s* = 5]~^/^s. (iv) Adjust the 
global trend by interpolating the standardized residuals via a simple kriging 
model yadj(x) = l^(x)L~^s*. In this way, we can form a sequential predictor 
as 

(17) yseq(x) =yglobal(x)+W^/^(x)yadj(x) = ygiobal(x) + W^^^ (x)!""" (x)L~^S* . 

It is of natural interest to ask whether this sequential predictor would make 
any difference from the single-stage predictor (13), and the following theorem 
establishes their connections. 

Theorem 1. Given the same parameter values, the single-stage predic- 
tor (13) and the sequential predictor (17) are equivalent. 

Proof of the theorem is left in the Appendix. Despite this equivalent model 
form, we want to emphasize that the single-stage fitting strategy is superior 
to the sequential one in parameter estimation. This is because all parameters 
in the single-stage predictor (13) can be optimized simultaneously, which 
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takes into account the interactions between global and local models and 
automatically balances their effects. In contrast to this global optimization, 
the sequential fitting approach estimates the parameters in two separate 
steps, and each of them can at most achieve local optimality. Generally, 
the global trend is hard to identify correctly without considering the effects 
of the second stage model, and in many cases the performance of the final 
prediction can be quite sensitive to this "global-local trade-off." As a result, 
in this paper we only consider the single-stage modeling framework, and 
this is also a major advantage for the proposed method over other multi- 
step strategies such as blind kriging. 

In the rest of this section, we present how to obtain the f (x) function, 
which is required for the CGP predictor. As shown in (14), the CGP model 
can be decomposed into a global and a local component, and this structure 
provides us a convenient way to assess the change of local volatility. For a 
given global trend (15) (initially we can set S = 1), its squared residuals 

= (sf , . . . , s^)""" are natural measures of the local volatility, which can be 
used as the bases to build the v{x) function. Based on s^, we propose an 
intuitive Gaussian kernel regression model for v{x) as 



where gb(x) = (s-fe(x-xi), . . . ,5rfe(x-x„))'r with gb{h) = exp(-6Xlj=i ^j^j)- 
Here 6 are the correlation parameters used in the global trend (15), b € [0, 1] 
is an extra bandwidth parameter such that gf,(x) — ?• 1 as 6 — >■ 0, and gb{^) = 
g(x) if 6 = 1. Since g(x) is the correlation of the global trend, the underly- 
ing assumption behind (18) is that whenever two points in the global trend 
are strongly correlated, their variances also tend to be more related. The 
bandwidth parameter b adds additional flexility in controlling the smooth- 
ness of the variance function: when equaling zero, it smoothes out v{x.) to a 
constant function even if the global trend is not flat. 

From the v(x) model in (18), we can evaluate Vi = f (xj) for i = 1, . . . ,n 
and update the matrix S = diag{i)i, . . . , Vn}- Since v{x) and S are the stan- 
dardized local volatilities, we also need to rescale them as 



This standardization makes the diagonal elements of S have unit mean, 
which is essential for keeping the ratio of cj^ to consistent in the global 
trend. By plugging the updated (and standardized) S back into (15), we 
can repeat the above process for a few more times. Usually three or four 
iterations are sufficient to stabilize the volatility estimates. This iterative 
estimation for variance is similar in spirit to the iteratively reweighted least 
squares method in classical regression. 



(18) 




(19) 
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Before concluding this section, we want to emphasize that the estimation 
of v{x) does not need to be separately carried out before fitting the CGP 
model; instead, it can be seamlessly nested as an inner loop in estimating 
the whole model. The u(x) function above is uniquely determined by the 
unknown parameters 9 and b. Since its correlation parameter 6 are always 
paired and synchronized with that of the global trend, inclusion of this 
volatility function f (x) only adds one more parameter b to the whole model. 

4. Estimation. In this section we derive maximum-likelihood estimators 
(MLEs) for the unknown parameters in the CGP model. As suggested at the 
end of previous section, given each set of (A, /U, r^, 0, a, 6) values, f(x) and 
Xl = diagl-Oi, . . . ,Vn} values can be uniquely determined by nesting a small 
inner loop in the likelihood function. 

Based on the multivariate normal assumptions in Section 3.2, the log- 
likelihood function (up to an additive constant) can be written as 

T'^,a'^,6, ct, b) 

= -ilog(det(r2G + a25]i/2LSi/2)) 

Due to the invariant property of MLE under transformations, we can repa- 
rameterize A = u^/r^ in the log-likelihood as 

/(A, fi, cx, b) 

(20) = -i[nlog(r2) + log(det(G + X-E^^^L-S^/^)) 

+ (y - /.1)"^(G + ASi/2LSV2)-i(y _ ^i)/r% 

Since X) = diagjui, . . . ,Vn} can be known through the procedures presented 
in the last section, the MLEs for fi and can be easily derived from (20) 
as 

/i(A, 6, a, b) 

(21) 

= (l"^(G + AS1/2lsV2)-1i)-1(iT(g + A5]V2L5]i/2)-iy), 
f'^{X,e,cx,b) 

(22) 

= -(y - /il)T(G + A5]V2L5]i/2)-i(y _ ^i). 

n 

After substituting these values into (20), we can obtain the MLEs for {X,6, 
cx,b) by minimizing the following (negative) log profile likelihood 

(23) 4>{X, e, a, b) = nlog(f2(A, 6>, a, b)) + log(det(G + AS^/^lS^/^))^ 

where AG [0, 1], 6 € [0, 1], 9j G [0, a'] and aj G [a', oo] for j = 1, . . . ,p. 
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For p input variables, the above likelihood function contains 2p + 2 un- 
known parameters. Compared to the stationary GP model whose likelihood 
contains only p unknown parameters, the CGP model becomes more dif- 
ficult to estimate when the input dimension p gets large. To mitigate this 
disadvantage, we can further assume 

(24) aj = ej + K, j = l,...,p. 

Now the CGP contains only p + 3 unknown parameters {\,0,K,b), whose 
MLEs can be obtained by minimizing 

(25) (t>{X,e, K,b) = nlog{f^{X,e, i^,b)) +log{det{G + Xi:^/^^^^/^)), 

subject to the constraints AG [0, € [0, 1], 9j € [0, a'] and k G [a',oo] for 
j = l,...,p. 

We now provide a general guideline for choosing the bound a'. The idea 
is to specify the value of based on the space-filling properties of the 
design points. Suppose the design D = {xi, . . . ,x„} has been standardized 
into the unit region of [0,1]^, and then define the following harmonic-type 
average inter-point distance (iavg to measure its space-filling properties [Ba 
and Joseph (2011)] 



1 \ ~V2 



\n{n-l) ^ d(xi,Xfc)2 

^ ^ ' l<i<k<n ^ '^^ 



where (i(xj,Xfc) = \/{Yl^j=i{^ij ~ ^kjY)- When we assume Oj = 9 and Oj 



a 



(j = 1, . . . ,p) in the Gaussian correlation functions (7), correlations between 
points with distance davg are exp(— 6'davg) and exp(— ad^vg) fo^ the global 
and local processes, respectively. Because exp(— ad^vg) < exp(— a'd^vg) ^ 
exp(— ^d^vg), we want to choose the bound a' to restrict the correlation in the 
local process to be small while ensuring that the correlation in the global pro- 
cess is not too small. Although the choice is not unique, our empirical study 
suggests that a good choice is to set exp(— a'd^vg) = 0.01, which leads to 



(26) a 



I _ log 100 

avg 



This bound is used for estimation throughout the paper. 
5. Properties. 



5.1. Improved prediction for sparse data sets. As discussed in Section 1, 
the ordinary kriging predictor tends to revert to the global mean in re- 
gions where data are not available. This erratic phenomenon will be even 
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Fig. 2. PZof o//jinrfion j/(a;) = sin(30(a:- 0.9)*) cos(2(2:- 0.9)) + (a:-0.9)/2, the global 
trend and the CGP predictor. 

more pronounced if the design points are sparse and cannot cover the in- 
put region reasonably weh. The new predictor, however, relaxes the constant 
mean restriction in ordinary kriging and introduces another GP for modeling 
the mean. This global trend (mean model) is noninterpolating but smooth, 
which makes it immune to the erratic reversion problem in the data sparse 
region. Consider again the simple test function in Figure 1, where the ordi- 
nary kriging predictor {9 = 469.37) appears to be erratic. We fitted the CGP 

model (A = 0.07,^ = 143.6, a = 1892.1,6 = 1) and its global trend is shown 
as a dotted line in Figure 2. Although it incurs large errors around data 
points in region x € [0,0.4], it behaves well in the sparse region [0.4, 1] due 
to the smoothness property. The final CGP predictor after incorporating 
the local trend is shown as a dashed line in Figure 2. It can be seen that 
this predictor eliminates all the noninterpolating errors at design points. At 
locations far from data points, it tends to revert to the smooth global trend 
instead of a global constant, which avoids the erratic problem as in Figure 1 
and yields much improved prediction. This shows the advantage of using the 
CGP predictor when data points are sparse in some parts of the design re- 
gion. In practice, the sparseness of data points is quite common when input 
dimensions are high or a nonspace-filling design is used. 

5.2. Numerical stability. One well-documented problem with the GP 
model is the potential numerical instability when computing the inverse of 
its n X n correlation matrix R. This correlation matrix can easily become ill- 
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conditioned, for example, when sample size n is large, design points are close 
to each other, or the sample points get highly correlated while we search for 
the optimal correlation parameters [Ababou, Bagtzoglou and Wood (1994), 
Haaland and Qian (2011), Peng and Wu (2012)]. A near-singular correlation 
matrix in kriging will lead to serious numerical problems, which causes the 
resulting predictor to be unstable and unreliable. 

To overcome this ill-conditioned problem, the popular approach is to add 
a nonzero nugget to the diagonal elements of the correlation matrix such 
that R (R + AI). Because including a nonzero nugget has the inevitable 
drawback of making predictors over-smooth (noninterpolating), in this ap- 
proach we need to reconcile the gains in numerical stability with the losses in 
interpolation property and choose a trade-off value for the nugget [Ranjan, 
Haynes and Karsten (2011), Peng and Wu (2012)]. 

As shown at the end of Section 3.1, the correlation matrix to invert in 
the proposed CGP model is (G-I-AL). (Cases after including the variance 
matrix S remain similar.) Since the lower bounds for a in (26) are mod- 
erately large and we have L w I, the A in (G -|- AL) automatically inflates 
the diagonal elements of the correlation matrix so that it is naturally re- 
sistent to becoming singular. In addition, different from the previous nugget 
case, the CGP model is always an interpolator and the A value here can be 
freely estimated. In fact, whenever a traditional CP model has to include a 
nonzero nugget for numerical reasons, the CGP model can always improve it 
at least by removing its noninterpolating errors with a augmented Ziocal(x). 
This potential improvement is shown in the next subsection. 

5.3. Connection with the nugget predictor. To emulate deterministic out- 
puts from computer experiments, Gramacy and Lee (2012) advocate always 
including a nonzero nugget in the kriging predictor for reasons even beyond 
computations. They argue that when model assumptions are violated or 
data points are sparse, the traditional CP predictor may lead to unpleas- 
ant results. Although adding a nonzero nugget to the predictor incurs extra 
errors around data points, it can be crucial for fitting a well-behaved (i.e., 
smooth) surface and avoiding erratic predictions in the unknown region. 
In a variety of situations, Gramacy and Lee (2012) show that overall this 
noninterpolating predictor can achieve better prediction accuracy. 

Interestingly, when the local process in CGP has zero correlation (L = I), 
its global trend just degenerates to a kriging predictor with nugget, and in 
this case the CGP predictor becomes y(x) = ynugget(x) +yiocai(x). In regions 
away from design points, since l(x) = and yiocai(x) = for x 7^ Xj (i = 
1, 2, . . . , n), the CGP model exactly matches the nugget predictor ynugget(x). 
At the n design points, however, due to l(x) = e,, for x = Xj (i = 1, 2, . . . , n), 
the 

yiocai(x) still corrects the global trend and adjusts the CGP to interpolate 
all the data points. Just as the universal kriging generalizes the polynomial 
regression for interpolation, the CGP model can be similarly viewed as a 
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(a) (b) 




Fig. 3. Plot of function y(x) = sin(107ra;)/(2x) + [x — 1)^ with (a) the ordinary kriging 
predictor; (b) the knging with nugget predictor; (c) the nugget predictor with adjustments 
around design points; (d) the optimized CGP predictor and its global trend. 

generalization/improvement of the nugget predictor which eliminates errors 
at design points. When correlations in the local process of CGP are further 
estimated as positive, the above adjustments around data points tend to be 
continuous and smooth, which leads to a final CGP predictor inheriting the 
advantages from both the nugget predictor and the interpolating predictor. 

Figure 3(a) demonstrates a simulated example from Gramacy and Lee 
(2012), where the test function y{x) = sin(107rx)/(2x) + (x — 1)^ is evalu- 
ated at 20 unequally spaced locations to represent the sparseness of data 
points. Clearly, we can see that in this example the ordinary kriging predic- 
tor (9 = 45.97) makes predictions well outside the range of test function in 
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many regions. The nugget predictor suggested by Gramacy and Lee (2012) 
is shown in Figure 3(b). Although noninterpolating, the nugget predictor 
overah gives smooth and reasonably good predictions, which reduces the root 
mean squared prediction error (RMSPE) from the previous 0.55 to 0.35. Here 

the RMSPE = T.Zl{yi^^) - yi^i)V?^'^ is computed based on iV = 5000 
randomly sampled data points from the design region. Now we further con- 
sider fitting the CGP model to this example. As shown in Figure 3(c), if 
we assume very small correlations in Ziocai(x), the new predictor remains 
almost the same as the nugget predictor within most regions; when it comes 
to around the design points, however, the predictor jumps to interpolate the 
data, which slightly reduces the RMSPE to 0.34. After we also fully estimate 
the correlations in Ziocai(x) and incorporate a variance model. Figure 3(d) 
gives the final CGP predictor (A = 0.019,^ = 2.44, d = 578.09,6 = 1), which 
is smooth and gives a RMSPE as low as 0.25. 



5.4. Improved prediction intervals. Apart from prediction, another fre- 
quently noted drawback of ordinary kriging is the poor coverage of its 
prediction intervals [Yamamoto (2000), Xiong et al. (2007), Gramacy and 
Lee (2012), Joseph and Kang (2011)]. By assuming a constant variance cr^ 
throughout the whole input region, the (1 — a) prediction interval at location 
X for ordinary kriging is given by 

.T/'^^-n-li\2 ^ 1/2 



y 



X) ± - r^(x)R-^r(x) + ^^^^^j 



where 2^/2 is the upper a/2 critical value of the standard normal distri- 
bution. This prediction interval is often too restrictive and inadequate to 
cover some complex underlying surfaces since it fails to take into account 
the change of local variability in the design region. One typical example is 
demonstrated in Figure 4(a), where the test function fluctuates around zero 
with decreasing amplitude. The corresponding prediction intervals from or- 
dinary kriging, however, yield the same variability pattern throughout the 
whole design region, which are obviously too narrow to cover the high volatil- 
ity region in the left part, but also end up unnecessarily wide in the right 
part of the input region where the true function is almost flat. In this sub- 
section, we introduce the prediction intervals for CGP models. By relaxing 
the constant variance restriction, these prediction intervals are self-adjusted 
according to the local variability and can be expected to give much improved 
coverage. 

In a Bayesian framework, the assumptions for a CGP model in (11) can 
be viewed as putting a prior distribution y(x)|/i ~ GP(/i, r^(7(-) -|- cr^(x)/(-)) 
on the function, which leads to the first-stage conditional distribution 
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Fig. 4. Plot of function y{x) = exp( — 2a:) sin(47rx^) and the prediction intervals from (a) 
ordinary kriging; (b) the CGP model. 



where A = a^/r^, q(x) = g(x) + \v^I^{^)Y}/\^), Q = G + AS^/^LSi/s 
and all the other notation remains the same as in Section 3.2. Here, for 
simplicity, the variance and correlation parameters are assumed to be known. 
If we further assume a second-stage noninformative prior for oc 1 and 

integrate it out, then the predictive distribution for y(x) can be derived as 

2/(x)iy ~iVi(^0|n(x),Wo|nW), 

where 

Mo|n(x) = A + q^(x)Q-i(y - fil) for /i = (iTq-Ii)"! (l^Q-iy) 

and 

(27) .o%(x) = r^jl + A.(x) - q^(x)Q-iq(x) + 

The derivation for these results is tedious but standard, which follows similar 
development steps as in Santner, Williams and Notz (2003), Section 4.3. It 
can be seen that our previously proposed predictor in (13) is nothing but the 
posterior mean of the function given the data. Now a (pointwise) prediction 
interval for this predictor can be constructed by 

(28) y{yL)±Za/2VQ\n{y^), 

where Za/2 is the upper a/2 critical value of the standard normal distribu- 
tion. 

Note that, since q~'"(xj)Q~-'^ = and e^q(xj) = 1 + Av(xj), the above 
posterior variance Uq|^(x) equals zero whenever x = Xj for i = 1, . . . ,n. Thus, 
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as in ordinary kriging, the width of the prediction interval shrinks to zero at 
each data point, which is quite intuitive since both models interpolate the 
responses at each observed location. On the other hand, however, different 
from ordinary kriging, the variance of predictive distribution in (27) depends 
on the local variability of the underlying surface, which intrinsically adjusts 
the widths of the prediction interval. Consider again the test function in 
Figure 4. It can be seen in Figure 4(b) that the prediction intervals from a 
CGP model {9 = 2.1, a = 54.85, A = 1, 6 = 1) become much wider in the left 
region when the function fluctuates rapidly, but quickly narrow down as the 
underlying function becomes flat. Compared with the prediction intervals 
for ordinary kriging, the new intervals can more precisely demonstrate the 
change of prediction uncertainties throughout the input region, that is, the 
predictive variances are much larger in the left part of region than in the 
right. One way to quantify such improvements is through computing the 
interval score for central prediction intervals [Gneiting and Raftery (2007)] 
which is defined as S'^^{l,u;x) = {u-l) + ^{l - x)l{x <l} + ^{x-u)l{x > 
u} for a (1 — a)% central prediction interval [l,u]. This scoring rule (to be 
minimized) rewards narrow prediction intervals and also penalizes lack of 
coverage. For the prediction intervals in Figure 4, the average interval score 
(based on 3000 randomly sampled test points) for the ordinary kriging in 
(a) is 0.62 while for the CGP model in (b) is only 0.32, which shows almost 
50% improvement. 

5.5. Extensions to noisy data. In the previous sections we model the 
deterministic outputs from a computer experiment by coupling two CPs. As 
an extension to this, sometimes it is also possible to use the sum of more 
than two CPs for gaining additional flexibility in the model and satisfying 
special needs. One important application of this extension is to modify the 
new predictor for modeling data with random errors. 

Based on the previous model form in Section 3.2, we can add a third CP 
(with zero correlation) to account for the white noise as follows: 

y(x) = Zgiobal(x) + 0-(x)Ziocal(x) + e(x), 

where .^global (x), Ziocai(x) are the same stationary CPs as in (11), and the 
error term e(x) is assumed to be A^(0,cj^(x)) distributed, uncorrelated at 
different input locations and also independent of the other two CPs. Suppose 
the error variances = diagjcTg (xi), . . . ,a'^{-Kn)} are given, then the best 
linear unbiased predictor can be easily updated by modifying (13) as follows: 

y(x) = A + (r2g(x) + a\i/2(x)sV2i(^))T(^2G ^ ^25.i/2ls1/2 + ^^y^ 

X (y- Ai) 

= A + (g(x) + A^;i/2(x)5]i/2l(x))^(G + ASi/^LS^/^ + pS,)-i(y - /II), 
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where p = l/r^, /i = (1^(0 + AS^^lE^^ ^ p5]^)-ii)-i(iT(G + XT}/^ x 
■|^^i/2 _j_ ^5]^^-iy) and all the other notation remains the same as in (13). 
This predictor for noisy data is no longer an interpolator, and its parameter 
estimation can be similarly carried out as in the previous sections, except 
for (G + AS;^/2l5]^/2) replaced by (G + AS^/^lS^/^ ^ ^5^^) -^^ ^he models. 

6. Examples. 

Example 1. For any nonstationary modeling approach, one commonly 
raised concern is that if the true surface is indeed a realization from a sta- 
tionary Gaussian process, whether the "unnecessarily sophisticated" nonsta- 
tionary modeling approach can perform as good as the "correct" stationary 
model. To test the performance of our proposed model in such cases, we sim- 
ulate sample paths from various two-dimensional stationary Gaussian pro- 
cesses 50 times and fit both the CGP and the stationary GP models to each 
of them for comparison. A 24-run maximin distance Latin Hypercube Design 
(LHD) is used in these simulations and for each time the true correlation 
parameters in GP are randomly generated from [1,5]. In each iteration, once 
the design and correlation parameters are fixed, a 24 x 24 correlation matrix 
R is uniquely determined. A sample path from the corresponding station- 
ary GP can then be drawn by simulating a random sample vector from the 
multivariate normal distribution Nn{^V^, cr^R") with n = 24, // = 0, o"^ = 1. 

After drawing stationary sample paths as above 50 times, we fit CGP 
models to each of them. Among the 50 fitted models, 42 out of them have 
A = 0, which shows that the CGP has perfectly degenerated to the stationary 
GP model. For the other eight CGP models, their A values are also extremely 
small, with the largest one only as 0.003. Measured by the leave-one-out 
cross-validation error, the prediction accuracy of the CGP model and the 
stationary GP model are almost identical in these cases. 

Example 2. In this example, we provide two test functions possessing 
nonstationary features: one in two dimensions and the other in 10 dimen- 
sions. The first function is f{xi,X2) = sin(l/(rEi2;2)) {xi,X2 G [0.3, 1]), whose 
surface fiuctuates rapidly when xi or X2 is small, but gradually becomes 
smooth as xi and X2 increase toward one. The second test function (known 
as Michalewicz's function) has the following form: 

10 r / • ,2 \ n 2m 

, < < vr, i = 1, . . . , 10. 



/(x) = -^sin(xi) sinf— 



Typically, this function is used with m = 10, which leads to a high-dimen- 
sional surface containing many local optima, and its volatility varies dra- 
matically throughout the input region. 

We use a 24-run maximin distance LHD and a 24-run adaptive design 
from Xiong et al. (2007) to evaluate the first test function. Both the GP and 
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Table 1 




RMSPE 


values for the two-dimensional function 


in Example 2 


Method 


Maximin LHD 


Adaptive design 


CP 


0.188 


0.266 


CGP 


0.144 


0.159 


TCP 


0.312 


0.465 



CGP models are fitted to these two designs, and their RMSPEs are compared 
based on additional 5000 randomly sampled testing data. From the results in 
Table 1, we can see that the CGP predictor improves the accuracy of the GP 
predictor by 23% and 40% for each design. Table 1 also shows the results of 
fitting the Bayesian treed Gaussian process (TCP) model [Gramacy and Lee 
(2008)]. The RMSPEs of this nonstationary treed model are relatively large, 
which probably are due to its inefficient partitioning of the input region. 

To further test the performance of the CGP predictor based on different 
designs, we generate fifty 100-run random LHDs to evaluate the second test 
function and fit the GP and CGP models to each of them. RMSPEs of the 
two predictors are plotted in Figure 5 for the 50 random designs. It can 
be seen that, compared to the GP model, the CGP predictor can always 
give better approximations to this complex surface. The RMSPEs of the 




0.60 0.65 0.70 0.75 0.80 0.85 0.90 

RMSPE of CGP 

Fig. 5. RMSPEs of GP and CGP models for Michalewicz's function in Example 2. 
Points falling above the diagonal line indicate larger prediction errors for the GP model. 



20 



S. BA AND V. R. JOSEPH 



two predictors based on a 100-run maximin distance LHD are also marked 
in this plot. 

Example 3. Qian et al. (2006) described a computer simulation of a 
heat exchanger for electronic cooling applications. The device under study 
consists of linear cellular materials and is used for dissipating the heat gen- 
erated by some sources such as a microprocessor. The response of interest is 
the total rate of steady state heat transfer of the device, which depends on 
the mass flow rate of entry air rh G (0.00055, 0.001), the temperature of entry 
air Tin & (270,303.15), the solid material thermal conductivity k G (330,400) 
and the temperature of the heat source T„aii G (202.4, 360). The device is as- 
sumed to have fixed overall width (W), depth (D) and height (H) of 9, 25 and 
17.4 millimeters, respectively. In Qian et al. (2006), the study involved two 
types of simulators: an expensive finite element simulator and a relatively 
cheaper finite difference simulator. Since the latter type of simulation was 
systematically conducted in the design space while the previous one was 
only available at limited locations, here we only focus on using the finite 
difference simulation results to compare the prediction accuracy of several 
different models. Because the four input variables are in very different scales, 
all of them are standardized into the (0, 1) region before analysis. 

Qian et al. (2006) used a 64-run orthogonal array-based Latin Hypercube 
design for running the finite difference simulations with an extra 14-run test 
data set for assessing the predictions from the surrogate model. If no prior in- 
formation is available for the function and an ordinary kriging with Gaussian 
correlation function is directly fitted, the maximum likelihood estimates for 
its correlation parameters are (0.22, 4.37, 0.14, 7.24), which yield a RMSPE 
of 5.15. However, for this particular problem, the physical domain knowledge 
indicates that a linear component is very likely to exist between the response 
and factors. As a result, Qian et al. (2006) included the linear trend into 
the model and fitted a universal kriging to the data. Their results showed 
that the linear effects for Tin and T„aii are significant but for the other two 
variables are almost negligible. By including these two linear effects into the 
global trend, the RMSPE can be successfully reduced to only 2.588. Now 
we fit a CGP model to the data for comparison. Based on the maximum 
likelihood method in Section 4, we can estimate the unknown parameters 
as = (0.008, 0.3, 0.01, 11.74), a = (11.81, 12.17, 11.94, 23.48), A = 0.019 and 
6 = 1. The RMSPE for this new predictor is 2.24, which is much better than 
the ordinary kriging and even smaller than the previous improved result 
from universal kriging. Note that in the global trend of this new predictor, 
the two correlation parameters 02 and 6*4 (for Tin and T^aii) are remarkably 
larger than the others, which perfectly coincides with the two significant 
linear trends in universal kriging. This demonstrates the effectiveness of 
the CGP model for capturing the global trend. In most common situations 
where no functional relationship in the global trend can be known in ad- 
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vance, the ability to automatically estimate the trend and the variance is a 
great advantage for the new predictor over the other methods. 

7. Conclusions. In this article we present an intuitive approach for ap- 
proximating complex surfaces that are not second-order stationary. The new 
predictor intrinsically incorporates a global trend and a flexible variance 
model, and all of its parameters can be estimated in a single stage. Compared 
with many existing methods, the new model enjoys several advantages such 
as numerical stability, improved prediction accuracy and flexible prediction 
intervals. An R package CGP for fitting the CGP model can be downloaded 
from http : //www . cran. r-project . org/. 

For modeling the nonstationarity in variance, one reviewer draws our at- 
tention to a related idea called scaling in the geostatistical literature [Baner- 
jee, Charlin and Gelfand (2003)]. The scaling approach is given in the form 
y(x) =(t(x)Z(x), where Z{x) denotes a stationary process and o"^(x) is a 
variance function that needs to be specified. By choosing (T^(x) as the expo- 
nent of another Gaussian process, Huang et al. (2011) proposed a stochastic 
heteroscedastic process (SHP) model y(x) = g^(x)/3 + o-exp(ra(x)/2)Z(x) 
for low-dimensional environmental applications, where a{x.) is defined to be 
another stationary Gaussian process that is independent of Z{x). Although 
this SHP model does not have a flexible global trend, its variance model is 
more sophisticated than our CGP model. This additional flexibility in vari- 
ance, however, comes with the expenses of a very difficult and complicated 
estimation procedure. Since the likelihood function of the SHP model has 
no closed-form expression, simulation-based approximations have to be ap- 
plied for the likelihood value during each step of its optimization. Obviously, 
this can be computationally very challenging (or even infeasible) when the 
dimension of unknown parameters is high, which limits its application in 
computer experiments. 

Recently, we also noticed an interesting work from Haaland and Qian 
(2011), which uses the sum of multiple CPs to emulate outputs from large 
scale computer experiments. However, the purposes of their work is different 
from ours. The aim of Haaland and Qian (2011) is mainly to control the 
numerical error in computing interpolators based on a huge amount of data. 
Their multiple CP models are fitted sequentially and each of them is only 
based on a subset of data points. On the contrary, our method is developed 
to improve the precision in modeling expensive simulation results that are 
not second-order stationary. Both our global and local CPs are fitted based 
on the entire data set and all parameters in our model are also estimated in 
a single stage. 

For p input factors, the proposed CGP model involves p + 3 unknown 
parameters, which is computationally slightly more expensive to fit than the 
ordinary kriging. This is the price we need to pay for incorporating the extra 
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flexility in modeling the global trend and the change of variance. We want 
to note that although the number of parameters in ordinary kriging can also 
be extended from p to 2p by generalizing its Gaussian correlation function 
to the power exponential correlation function r(h) = exp(— ^^^-^ ) 
or even a Matern correlation function, this extension alone cannot solve the 
problems discussed in this paper, since the resulting predictor still remains 
second-order stationary. 

APPENDIX: PROOF OF THEOREM 1 

Since both the single-stage predictor (14) and the sequential predictor 
(17) contain the same global trend ygiobai(x) as in (15), we only need to 
prove yiocai(x) = 7;^/^(x)yadj (x): 

z;V2(x)y^dj(x) 

= 7;i/2(x)lT(x)L-i5]-i/2[y _ ^1 _ G(G + AS^/^LS^/^^-i^y _ ^j^j 
= i;V2(x)iT(^)L-iS-V2[i _ G(G + ASi/2LsV2)-i](y _ ^i) 
= At;i/2(x)r(x)I]i/2(A5]i/2LSi/2)-i 

X [I - G(G + ASi/2LSi/2)-i](y _ /ii) 
=W A^;i/2(^)iT(^)5.i/2(G ^ ASV2LI]V2)-i(y _ /,i) 

= yiocal(x), 

where the equality holds because (AS^/^lE^/^^-ijj _ q(q _^ )^-^i/2 ^ 
LS1/2)-1](g + AI]1/2ls1/2)=I. 
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